Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky maxSubmitRetries test #227

Merged
merged 4 commits into from
Jul 23, 2018
Merged

Fix flaky maxSubmitRetries test #227

merged 4 commits into from
Jul 23, 2018

Conversation

alecgibson
Copy link
Collaborator

This is a slightly speculative fix for a test that fails intermittently
on sharedb-mongo. I believe these intermittent failures are due to a
race condition in a concurrency test.

The test works by attempting to fire two commits off at the same time,
and hoping that one of them is committed just before the other, so that
a SubmitRequest.retry is triggered whilst the maxSubmitRetries is
set to 0, resulting in an error that is expected.

However, I believe it's possible for these commits to (in some cases)
happen sequentially rather than concurrently, and fail to error.

This change attempts to force them into this retry condition by:

  • Catching both ops in the commit middleware, just before they're
    about to be committed (and hit a retry if applicable)
  • Waiting until both ops have reached this state
  • Triggering the first op's commit
  • Then in the callback of that op, triggering the second op's commit
  • The second op should now find that the first op has beaten it to
    committing, and trigger a retry

@coveralls
Copy link

coveralls commented Jul 19, 2018

Coverage Status

Coverage increased (+0.02%) to 96.514% when pulling e9ff681 on alecgibson:fix-flaky-test into 762e05d on share:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.6%) to 97.101% when pulling 8332411 on alecgibson:fix-flaky-test into 762e05d on share:master.

Alec Gibson added 2 commits July 19, 2018 15:55
This is a slightly speculative fix for a test that fails intermittently
on `sharedb-mongo`. I believe these intermittent failures are due to a
race condition in a concurrency test.

The test works by attempting to fire two commits off at the same time,
and hoping that one of them is committed just before the other, so that
a `SubmitRequest.retry` is triggered whilst the `maxSubmitRetries` is
set to `0`, resulting in an error that is expected.

However, I believe it's possible for these commits to (in some cases)
happen sequentially rather than concurrently, and fail to error.

This change attempts to force them into this retry condition by:

  - Catching both ops in the `commit` middleware, _just_ before they're
    about to be committed (and hit a `retry` if applicable)
  - Waiting until both ops have reached this state
  - Triggering the first op's `commit`
  - Then in the callback of that op, triggering the second op's `commit`
  - The second op should now find that the first op has beaten it to
    committing, and trigger a `retry`
Copy link
Contributor

@ericyhwang ericyhwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good sleuthing!

I believe that the second client's commit will hit the retry case as long as both clients get the same snapshot back from backend.db.getSnapshot in SubmitRequest.submit:

https://github.com/share/sharedb/blob/v1.0.0-beta.8/lib/submit-request.js#L45

Delaying the sending of commits is a pretty neat way of doing it, and I like how clean it ended up being too. 👍

});
doc.submitOp({p: ['age'], na: 2}, function (error) {
if (error) return done(error);
doc2Callback();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can move this up to just below docCallback(), as by that point both clients will have already fetched the snapshot in submit and gone through apply.

Having the doc2Callback() call up there makes it a bit more obvious that the middleware is deferring the first callback until the second one is ready, and calling both in sequence.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. But isn't the commit hook just before they've been committed to the db? ie there's still a potential race condition if one of them gets committed before the other. That is, we don't know which callback will finish first, and get forced into retry...? This way we guarantee that doc will always be committed before doc2.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That could be mitigated by putting back the old shared callback with a count, but I find that sort of thing a bit sketchy; it's like admitting that you're not really in control of your own test.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, we don't know which callback will finish first, and get forced into retry...? This way we guarantee that doc will always be committed before doc2.

(That could be mitigated by putting back the old shared callback with a count, but I find that sort of thing a bit sketchy; it's like admitting that you're not really in control of your own test.)

Got it, you want to sequence the actual commits so you know 1 goes before 2. Makes sense, thanks for explaining!

done();
var docCallback;
var doc2Callback;
backend.use('commit', function (request, callback) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I and future contributors (like future you!) would appreciate it if you added some comments here about why this is needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ericyhwang
Copy link
Contributor

Forgot to drop this note for the future - the failing test and its output:
https://travis-ci.org/share/sharedb-mongo/jobs/405812496#L1163

  1) db client submit submits fail above the backend.maxSubmitRetries threshold:
     Uncaught Error: expected undefined to be truthy
      at Assertion.assert (node_modules/expect.js/index.js:96:13)
      at Assertion.ok (node_modules/expect.js/index.js:115:10)
      at cb (node_modules/sharedb/test/client/submit.js:525:25)
      at callEach (node_modules/sharedb/lib/client/doc.js:905:7)
      at Doc._clearInflightOp (node_modules/sharedb/lib/client/doc.js:891:16)
      at Doc._opAcknowledged (node_modules/sharedb/lib/client/doc.js:836:8)
      at Doc._handleOp (node_modules/sharedb/lib/client/doc.js:286:10)
      at Connection.handleMessage (node_modules/sharedb/lib/client/connection.js:243:20)
      at StreamSocket.socket.onmessage (node_modules/sharedb/lib/client/connection.js:125:18)
      at node_modules/sharedb/lib/stream-socket.js:59:12
      at _combinedTickCallback (internal/process/next_tick.js:131:7)
      at process._tickCallback (internal/process/next_tick.js:180:9)

The failure can be reproduced artificially by wrapping the doc2.submitOp call in a very short setTimeout, to simulate the doc1 commit going through very quickly or the doc2 commit's snapshot fetch being a bit slow.

Copy link
Contributor

@ericyhwang ericyhwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the explanations to the test!

I can merge this, but I can't publish it to NPM since I'm not a maintainer on the sharedb NPM package.

@nateps - Can you please handle merging, versioning, and publishing this? This will unblock some other PRs on e.g. sharedb-mongo that include and run this test.

.travis.yml Outdated
@@ -5,6 +5,6 @@ node_js:
- "8"
- "6"
- "4"
script: "npm run jshint && npm run test-cover"
script: "ln -s .. node_modules/sharedb; npm run jshint && npm run test-cover"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping another note for the future, this is to work around the circular dependency issue between sharedb and sharedb-mingo-memory:
#226 (comment)

I'm OK with this workaround for now to unblock the tests as the workaround is isolated to Travis, until we figure out a better solution in the linked issue above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe it's better to get the #226 merged first, as it also updates the Travis config's Node versions to reflect latest + LTE versions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Happy to wait and rebase. Would just be nice to get this through to unblock other PRs.

@nateps
Copy link
Contributor

nateps commented Jul 23, 2018

Thanks for the contribution! Fixing flakey tests always brings me great joy. 💥

@nateps nateps merged commit 632fad3 into share:master Jul 23, 2018
@alecgibson alecgibson deleted the fix-flaky-test branch July 23, 2018 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants